Analysis of LinkedIn Jobs and Skills (2024)¶

Vikram Rangarajan, Ved Karamsetty

Github Page

Webpage Version of Notebook

Vikram Rangarajan: Contributed to B-H (95%)

Ved Karamsetty: Contributed to A, B, C, E, F, G, H (75%)

Specific Contributions:

Vikram: Did most of the coding, Ved's PC was unable to (this dataset is MASSIVE), and Colab did not allow a GPU for extended periods of time (required for ML section).

Ved: Chose dataset, gave inputs on things like how to preprocess data, what hypothesis tests to do, ML techniques, visualization, etc.

We did most of the code together in a call and screen share.

We, all team members, agree together that the above information is true, and we are confident about our contributions to this submitted project/final tutorial.

  • Vikram Rangarajan, Ved Karamsetty, 5/6/24

Sections:

  1. Introduction
  2. Data Curation
  3. Exploratory Data Analysis (EDA)
    • 3a. Data Preprocessing
    • 3b. Data Exploration and Summary Statistics
  4. Primary Analysis (ML)
  5. Visualization
  6. Insights and Conclusions

1. Introduction¶

The topic of our project is the analysis of LinkedIn jobs during 2024. The goal is to find insights on job locations, key skills required, and industry statistics. As college students, finding a job will be the next stage of our careers, and these insights could give us an upper hand when job searching. Specifically, for Computer Science students like us, we can see job trends relating to our skillset.

Required Python libraries:

  • Pandas
  • pycountry
  • Jupyter for the notebook and IPython.display.display
  • plotly
  • fastparquet (optional, if you are using parquets)
  • statsmodels for hypothesis testing
  • scipy
  • [nbformat>=4.2.0] for plotly's mimetype rendering backend

2. Data Curation¶

We will be analyzing the 1.3 Linkedin Jobs and Skills (2024) dataset on Kaggle. This dataset was webscraped from the website itself, giving us accurate and up-to-date data to work with.

This dataset consists of 3 tables:

  • job_skills.csv
    • This contains the expected job skills for every job posting.
      • job_link: Foreign key, str
        • Link to Linkedin posting
      • skills: str
        • Comma separated string values. Ex: Customer Service, Communication, Typing
  • job_summary.csv
    • This contains the summary (job description) for each job posting, alongside the link to the posting, making 2 total columns.
      • job_link: Foreign key, str
        • Link to Linkedin posting
      • job_summary: str
        • Job summary, plain text.
  • linkedin_job_postings.csv
    • This is the most important table. It contains several columns:
      • job_link: Foreign key, str
        • Link to Linkedin posting
      • last_processed_time: str (to be converted to datetime)
        • Timestamp indicating the last time the job posting was processed
      • got_summary: str (to be converted to bool)
        • Indicates whether the job summary was successfully extracted or not
      • got_ner: str (to be converted to bool)
        • Indicates whether Named Entity Recognition (NER) was performed on the job posting or not
      • is_being_worked: str (to be converted to bool)
        • Indicates if the job posting is currently being worked on or not
      • job_title: str
        • Title of the job listing
      • company: str
        • Company name offering the job position
      • job_location: str
        • Location of the job position. This is a comma separated string, coming in many formats
      • job_type: str
        • Either remote, onsite, or hybrid.
      • job_level: str
        • Either Apprentice or Mid-Senior level.
      • first_seen: str (to be converted to datetime)
        • Timestamp indicating when the job posting was first seen
      • search_city: str
        • City used as a search criterion for collecting the job postings

A problem faced was that these tables, especially job_summary.csv, were extremely large in size (>5GB), and took up a lot of disk space and took lots of time to load. To alleviate this, we converted these 3 csv's to parquet files to decrease disk space and loading times. Note that this is not necessary, this is just because while debugging we want to minimize loading times as much as possible. We have provided a script to do this conversion.

Unfortunately, geospatial (latitude & longitude) data is not included with the job postings, so another dataset is used to join each job posting with the correct latitude and longitude, for geospatial plots. We chose the Countries States Cities Database repository, as it is popular, it has been starred 6.3k times, is open source, and contains just what we need, which is the cities.csv file.

  • cities.csv
    • Contains city data for many cities worldwide.
      • id: int
        • Unused
      • name: str
        • Name of city
      • state_id: int
        • Unused
      • state_code: str
        • Unused
      • country_id: int
        • Unused
      • country_code: str (2 letters)
        • The ISO 3166-1 alpha-2 code of the country the city is in
      • country_name: str
        • The name of the country which the city is in
      • latitude: float
        • Latitude of city
      • longitude: float
        • Longitude of city
      • wikiDataId: str
        • Unused

3. Exploratory Data Analysis (EDA)¶

3a. Data Preprocessing¶

We need to do a lot of data processing on this dataset, everything from data types to adding data from other tables, and a lot of string processing.

We will start off with our imports and preliminary settings

In [1]:
import pandas as pd
import pycountry
from IPython.display import display
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
from collections import Counter
import os
from statsmodels.stats.proportion import proportions_ztest
from scipy import stats
import itertools
from sklearn import cluster, metrics, feature_extraction, decomposition
from wordcloud import WordCloud
from sentence_transformers import SentenceTransformer

# This allows for interactive maps when using jupyter nbconvert to create this HTML page
pio.renderers.default = "plotly_mimetype+notebook"
pd.options.plotting.backend = "plotly"

Now we will load linkedin_job_postings.csv.

In [2]:
if os.path.isfile("linkedin_job_postings.parquet"):
    postings = pd.read_parquet("linkedin_job_postings.parquet")
else:
    postings = pd.read_csv("linkedin_job_postings.csv")
display(postings)
job_link last_processed_time got_summary got_ner is_being_worked job_title company job_location first_seen search_city search_country search_position job_level job_type
0 https://www.linkedin.com/jobs/view/account-exe... 2024-01-21 07:12:29.00256+00 t t f Account Executive - Dispensing (NorCal/Norther... BD San Diego, CA 2024-01-15 Coronado United States Color Maker Mid senior Onsite
1 https://www.linkedin.com/jobs/view/registered-... 2024-01-21 07:39:58.88137+00 t t f Registered Nurse - RN Care Manager Trinity Health MI Norton Shores, MI 2024-01-14 Grand Haven United States Director Nursing Service Mid senior Onsite
2 https://www.linkedin.com/jobs/view/restaurant-... 2024-01-21 07:40:00.251126+00 t t f RESTAURANT SUPERVISOR - THE FORKLIFT Wasatch Adaptive Sports Sandy, UT 2024-01-14 Tooele United States Stand-In Mid senior Onsite
3 https://www.linkedin.com/jobs/view/independent... 2024-01-21 07:40:00.308133+00 t t f Independent Real Estate Agent Howard Hanna | Rand Realty Englewood Cliffs, NJ 2024-01-16 Pinehurst United States Real-Estate Clerk Mid senior Onsite
4 https://www.linkedin.com/jobs/view/group-unit-... 2024-01-19 09:45:09.215838+00 f f f Group/Unit Supervisor (Systems Support Manager... IRS, Office of Chief Counsel Chamblee, GA 2024-01-17 Gadsden United States Supervisor Travel-Information Center Mid senior Onsite
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1348449 https://www.linkedin.com/jobs/view/registered-... 2024-01-20 15:21:07.786118+00 t t f Registered Nurse (RN) #CP-RN-7998660 - 2411627... TravelNurseSource Providence, RI 2024-01-14 Fall River United States Nurse Supervisor Mid senior Onsite
1348450 https://www.linkedin.com/jobs/view/constructio... 2024-01-20 15:21:10.885264+00 t t f Construction Superintendent Jobot New Iberia, LA 2024-01-15 Lafayette United States Assistant Construction Superintendent Mid senior Onsite
1348451 https://www.linkedin.com/jobs/view/executive-c... 2024-01-21 07:40:00.304641+00 t t f Executive Chef, Operations Support NEXDINE Hospitality Riverhead, NY 2024-01-14 Eastport United States Chef Mid senior Onsite
1348452 https://www.linkedin.com/jobs/view/rn-register... 2024-01-21 00:38:39.816821+00 t t f RN- Registered Nurse, Analyst - - 23934913EXPP... TravelNurseSource Aurora, CO 2024-01-16 Colorado United States Occupational Analyst Mid senior Onsite
1348453 https://www.linkedin.com/jobs/view/on-demand-g... 2024-01-21 00:38:44.231492+00 t t f On-Demand: Guest Advocate (Cashier), General M... Target Culver City, CA 2024-01-12 Malibu United States Cashier Ii Mid senior Onsite

1348454 rows × 14 columns

Now we load job_skills.

In [3]:
if os.path.isfile("job_skills.parquet"):
    skills = pd.read_parquet("job_skills.parquet")
else:
    skills = pd.read_csv("job_skills.csv")
display(skills)
job_link job_skills
0 https://www.linkedin.com/jobs/view/housekeeper... Building Custodial Services, Cleaning, Janitor...
1 https://www.linkedin.com/jobs/view/assistant-g... Customer service, Restaurant management, Food ...
2 https://www.linkedin.com/jobs/view/school-base... Applied Behavior Analysis (ABA), Data analysis...
3 https://www.linkedin.com/jobs/view/electrical-... Electrical Engineering, Project Controls, Sche...
4 https://www.linkedin.com/jobs/view/electrical-... Electrical Assembly, Point to point wiring, St...
... ... ...
1296376 https://www.linkedin.com/jobs/view/community-a... Communication Skills, Time Management, Custome...
1296377 https://www.linkedin.com/jobs/view/sr-it-analy... Windows SQL, EDI X12, Edifecs Platform, Health...
1296378 https://www.linkedin.com/jobs/view/operations-... Adaptability, Communication, Digital Fluency, ...
1296379 https://www.linkedin.com/jobs/view/float-patie... CNA, EMT, BLS, Medical Assistant, CPCT, LPN, R...
1296380 https://www.linkedin.com/jobs/view/conductor-e... Customer Service, Driving, Loading, Unloading,...

1296381 rows × 2 columns

Same for job_summary.csv

In [4]:
if os.path.isfile("job_summary.parquet"):
    summary = pd.read_parquet("job_summary.parquet")
else:
    summary = pd.read_csv("job_summary.csv")
display(summary)
job_link job_summary
0 https://www.linkedin.com/jobs/view/restaurant-... Rock N Roll Sushi is hiring a Restaurant Manag...
1 https://www.linkedin.com/jobs/view/med-surg-re... Schedule\n: PRN is required minimum 12 hours p...
2 https://www.linkedin.com/jobs/view/registered-... Description\nIntroduction\nAre you looking for...
3 https://uk.linkedin.com/jobs/view/commercial-a... Commercial account executive\nSheffield\nFull ...
4 https://www.linkedin.com/jobs/view/store-manag... Address:\nUSA-CT-Newington-44 Fenn Road\nStore...
... ... ...
1297327 https://www.linkedin.com/jobs/view/roofing-sup... We are currently seeking experienced commercia...
1297328 https://www.linkedin.com/jobs/view/service-cen... Overview\nStable and growing organization\nCom...
1297329 https://www.linkedin.com/jobs/view/flight-qual... Rôle et responsabilités\nJob Description:\nFli...
1297330 https://www.linkedin.com/jobs/view/global-sour... Job Description\nAre You Ready to Make It Happ...
1297331 https://www.linkedin.com/jobs/view/research-as... Current Employees:\nIf you are a current Staff...

1297332 rows × 2 columns

Similar for cities.csv. Here, we also remove .'s and accented characters and other cleaning.

In [5]:
# https://github.com/dr5hn/countries-states-cities-database
cities = pd.read_csv("cities.csv")[
    ["name", "state_name", "country_code", "country_name", "latitude", "longitude"]
]

# Deletes .'s and replaces accented characters with standard alphabetical
cities["name"] = cities["name"].str.replace(".", "")
cities["name"] = (
    cities["name"]
    .str.normalize("NFKD")
    .str.encode("ascii", errors="ignore")
    .str.decode("utf-8")
)
print("Rows before dropping nulls and duplicates", cities.shape[0])
cities.drop_duplicates(["name", "state_name", "country_code"], inplace=True)
cities.dropna(inplace=True)
cities.reset_index(inplace=True, drop=True)
print("Rows after dropping nulls and duplicates", cities.shape[0])
print("Counts:")
print(cities.count())
display(cities[["name", "state_name", "country_code", "country_name"]].describe())
display(cities[["latitude", "longitude"]].describe())
display(cities)
Rows before dropping nulls and duplicates 150574
Rows after dropping nulls and duplicates 150497
Counts:
name            150497
state_name      150497
country_code    150497
country_name    150497
latitude        150497
longitude       150497
dtype: int64
name state_name country_code country_name
count 150497 150497 150497 150497
unique 131214 3487 197 197
top Merkez England US United States
freq 51 2918 19820 19820
latitude longitude
count 150497.000000 150497.000000
mean 31.565417 2.328172
std 22.795936 68.026230
min -75.000000 -179.121980
25% 18.993450 -58.963770
50% 40.680660 8.657070
75% 47.238680 27.732170
max 73.508190 179.466000
name state_name country_code country_name latitude longitude
0 Ashkasham Badakhshan AF Afghanistan 36.68333 71.53333
1 Fayzabad Badakhshan AF Afghanistan 37.11664 70.58002
2 Jurm Badakhshan AF Afghanistan 36.86477 70.83421
3 Khandud Badakhshan AF Afghanistan 36.95127 72.31800
4 Raghistan Badakhshan AF Afghanistan 37.66079 70.67346
... ... ... ... ... ... ...
150492 Redcliff Midlands Province ZW Zimbabwe -19.03333 29.78333
150493 Shangani Midlands Province ZW Zimbabwe -19.78333 29.36667
150494 Shurugwi Midlands Province ZW Zimbabwe -19.67016 30.00589
150495 Shurugwi District Midlands Province ZW Zimbabwe -19.75000 30.16667
150496 Zvishavane District Midlands Province ZW Zimbabwe -20.30345 30.07514

150497 rows × 6 columns

Now we merge all of our LinkedIn dataframes into one dataframe called df and do some more cleaning.

In [6]:
df = postings.merge(skills, how="inner", on="job_link").merge(
    summary, how="inner", on="job_link"
)
print("Rows before dropping nulls and duplicates", df.shape[0])
df.drop_duplicates("job_link", inplace=True)
df.dropna(inplace=True)
df.reset_index(inplace=True, drop=True)
print("Rows after dropping nulls and duplicates", df.shape[0])

# We only want the alphabetical letters
df["job_skills"] = df["job_skills"].str.lower().replace("[^a-z]", " ", regex=True)

# We clean job_location to remote accented characters
df["job_location"] = (
    df["job_location"]
    .str.replace(".", "")
    .str.normalize("NFKD")
    .str.encode("ascii", errors="ignore")
    .str.decode("utf-8")
)

print("Counts:")
print(df.count())
display(df.describe())
display(df)
Rows before dropping nulls and duplicates 1296381
Rows after dropping nulls and duplicates 1294268
Counts:
job_link               1294268
last_processed_time    1294268
got_summary            1294268
got_ner                1294268
is_being_worked        1294268
job_title              1294268
company                1294268
job_location           1294268
first_seen             1294268
search_city            1294268
search_country         1294268
search_position        1294268
job_level              1294268
job_type               1294268
job_skills             1294268
job_summary            1294268
dtype: int64
job_link last_processed_time got_summary got_ner is_being_worked job_title company job_location first_seen search_city search_country search_position job_level job_type job_skills job_summary
count 1294268 1294268 1294268 1294268 1294268 1294268 1294268 1294268 1294268 1294268 1294268 1294268 1294268 1294268 1294268 1294268
unique 1294268 720667 1 1 1 564775 88932 28773 6 1018 4 1922 2 3 1285967 955934
top https://www.linkedin.com/jobs/view/account-exe... 2024-01-19 09:45:09.215838+00 t t f LEAD SALES ASSOCIATE-FT Health eCareers New York, NY 2024-01-14 North Carolina United States Account Executive Mid senior Onsite front counter drivethru outside order taker ... Dollar General Corporation has been delivering...
freq 1 573435 1294268 1294268 1294268 7315 40010 12561 458587 9485 1103592 19463 1153364 1283454 183 4565
job_link last_processed_time got_summary got_ner is_being_worked job_title company job_location first_seen search_city search_country search_position job_level job_type job_skills job_summary
0 https://www.linkedin.com/jobs/view/account-exe... 2024-01-21 07:12:29.00256+00 t t f Account Executive - Dispensing (NorCal/Norther... BD San Diego, CA 2024-01-15 Coronado United States Color Maker Mid senior Onsite medical equipment sales key competitors term... Responsibilities\nJob Description Summary\nJob...
1 https://www.linkedin.com/jobs/view/registered-... 2024-01-21 07:39:58.88137+00 t t f Registered Nurse - RN Care Manager Trinity Health MI Norton Shores, MI 2024-01-14 Grand Haven United States Director Nursing Service Mid senior Onsite nursing bachelor of science in nursing maste... Employment Type:\nFull time\nShift:\nDescripti...
2 https://www.linkedin.com/jobs/view/restaurant-... 2024-01-21 07:40:00.251126+00 t t f RESTAURANT SUPERVISOR - THE FORKLIFT Wasatch Adaptive Sports Sandy, UT 2024-01-14 Tooele United States Stand-In Mid senior Onsite restaurant operations management inventory ma... Job Details\nDescription\nWhat You'll Do\nAs a...
3 https://www.linkedin.com/jobs/view/independent... 2024-01-21 07:40:00.308133+00 t t f Independent Real Estate Agent Howard Hanna | Rand Realty Englewood Cliffs, NJ 2024-01-16 Pinehurst United States Real-Estate Clerk Mid senior Onsite real estate customer service sales negotiat... Who We Are\nRand Realty is a family-owned brok...
4 https://www.linkedin.com/jobs/view/registered-... 2024-01-21 08:08:19.663033+00 t t f Registered Nurse (RN) Trinity Health MI Muskegon, MI 2024-01-14 Muskegon United States Nurse Practitioner Mid senior Onsite nursing bsn medical license virtual rn nur... Employment Type:\nFull time\nShift:\n12 Hour N...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1294263 https://www.linkedin.com/jobs/view/registered-... 2024-01-20 15:21:07.786118+00 t t f Registered Nurse (RN) #CP-RN-7998660 - 2411627... TravelNurseSource Providence, RI 2024-01-14 Fall River United States Nurse Supervisor Mid senior Onsite registered nurse bls certification nursing c... TravelNurseSource is working with Adelphi Medi...
1294264 https://www.linkedin.com/jobs/view/constructio... 2024-01-20 15:21:10.885264+00 t t f Construction Superintendent Jobot New Iberia, LA 2024-01-15 Lafayette United States Assistant Construction Superintendent Mid senior Onsite construction management project planning est... Want to learn more about this role and Jobot? ...
1294265 https://www.linkedin.com/jobs/view/executive-c... 2024-01-21 07:40:00.304641+00 t t f Executive Chef, Operations Support NEXDINE Hospitality Riverhead, NY 2024-01-14 Eastport United States Chef Mid senior Onsite culinary chef director menu writing cycle o... NEXDINE HOSPITALITY one of the nation's fastes...
1294266 https://www.linkedin.com/jobs/view/rn-register... 2024-01-21 00:38:39.816821+00 t t f RN- Registered Nurse, Analyst - - 23934913EXPP... TravelNurseSource Aurora, CO 2024-01-16 Colorado United States Occupational Analyst Mid senior Onsite registered nurse analyst rn registered nurse... TravelNurseSource is working with Alliant Pers...
1294267 https://www.linkedin.com/jobs/view/on-demand-g... 2024-01-21 00:38:44.231492+00 t t f On-Demand: Guest Advocate (Cashier), General M... Target Culver City, CA 2024-01-12 Malibu United States Cashier Ii Mid senior Onsite customer service communication problem solvi... Description: Starting Hourly Rate / Salario po...

1294268 rows × 16 columns

In [7]:
del postings, summary, skills  # Clear memory

Now, we do datatype manipulation. We have to:

  • Convert t & f to python True and False (in the postings table)
  • Convert date strings to pd.datetime64 (postings table)
  • Turn comma separated strings into lists (skills and postings tables)
  • Take job location and pull out the city, state, and country for each posting, and imputate values as needed
  • Join postings table with cities table to get latitudes and longitudes of every city (for geospatial visualization)
In [8]:
def t_f_to_bool(x):
    return True if x == "t" else False


def split_str(x):
    if isinstance(x, str):
        return x.split(", ")
    return []


job_level = {"Associate": 0, "Mid senior": 1}

Postings table manipulation

In [9]:
# Turn t and f into True and False
for col in ["got_summary", "got_ner", "is_being_worked"]:
    df[col] = df[col].apply(t_f_to_bool)

# Convert strings to dates
df["last_processed_time"] = pd.to_datetime(
    df["last_processed_time"],
    format="ISO8601",
)
df["first_seen"] = pd.to_datetime(
    df["first_seen"],
    format="ISO8601",
)

df.head()
Out[9]:
job_link last_processed_time got_summary got_ner is_being_worked job_title company job_location first_seen search_city search_country search_position job_level job_type job_skills job_summary
0 https://www.linkedin.com/jobs/view/account-exe... 2024-01-21 07:12:29.002560+00:00 True True False Account Executive - Dispensing (NorCal/Norther... BD San Diego, CA 2024-01-15 Coronado United States Color Maker Mid senior Onsite medical equipment sales key competitors term... Responsibilities\nJob Description Summary\nJob...
1 https://www.linkedin.com/jobs/view/registered-... 2024-01-21 07:39:58.881370+00:00 True True False Registered Nurse - RN Care Manager Trinity Health MI Norton Shores, MI 2024-01-14 Grand Haven United States Director Nursing Service Mid senior Onsite nursing bachelor of science in nursing maste... Employment Type:\nFull time\nShift:\nDescripti...
2 https://www.linkedin.com/jobs/view/restaurant-... 2024-01-21 07:40:00.251126+00:00 True True False RESTAURANT SUPERVISOR - THE FORKLIFT Wasatch Adaptive Sports Sandy, UT 2024-01-14 Tooele United States Stand-In Mid senior Onsite restaurant operations management inventory ma... Job Details\nDescription\nWhat You'll Do\nAs a...
3 https://www.linkedin.com/jobs/view/independent... 2024-01-21 07:40:00.308133+00:00 True True False Independent Real Estate Agent Howard Hanna | Rand Realty Englewood Cliffs, NJ 2024-01-16 Pinehurst United States Real-Estate Clerk Mid senior Onsite real estate customer service sales negotiat... Who We Are\nRand Realty is a family-owned brok...
4 https://www.linkedin.com/jobs/view/registered-... 2024-01-21 08:08:19.663033+00:00 True True False Registered Nurse (RN) Trinity Health MI Muskegon, MI 2024-01-14 Muskegon United States Nurse Practitioner Mid senior Onsite nursing bsn medical license virtual rn nur... Employment Type:\nFull time\nShift:\n12 Hour N...

Here, we split job_location into a list of strings, as opposed to a comma separated string like it was before.

In [10]:
df["job_loc_list"] = df["job_location"].apply(split_str)

We will eventually be dropping any job locations which greater than 3 or less than 1 elements. There are only 28 of them, which is negligible in the face of 1.3 million

In [11]:
length_of_cities = df["job_loc_list"].map(len)
num_invalid = df["job_loc_list"][
    (length_of_cities > 3) | (length_of_cities < 1)
].shape[0]
num_invalid, round(num_invalid / df["job_loc_list"].shape[0] * 100, 2)
Out[11]:
(28, 0.0)

These are some dictionaries for the imputation to fill in missing values. It has US state codes to full state names, and vice versa. Also, there is a mapping for every U.S. state to the most populous city for if the city is missing. We got these values from wikipedia.

In [12]:
us_states = {
    "AL": "Alabama",
    "AK": "Alaska",
    "AZ": "Arizona",
    "AR": "Arkansas",
    "CA": "California",
    "CO": "Colorado",
    "CT": "Connecticut",
    "DE": "Delaware",
    "FL": "Florida",
    "GA": "Georgia",
    "HI": "Hawaii",
    "ID": "Idaho",
    "IL": "Illinois",
    "IN": "Indiana",
    "IA": "Iowa",
    "KS": "Kansas",
    "KY": "Kentucky",
    "LA": "Louisiana",
    "ME": "Maine",
    "MD": "Maryland",
    "MA": "Massachusetts",
    "MI": "Michigan",
    "MN": "Minnesota",
    "MS": "Mississippi",
    "MO": "Missouri",
    "MT": "Montana",
    "NE": "Nebraska",
    "NV": "Nevada",
    "NH": "New Hampshire",
    "NJ": "New Jersey",
    "NM": "New Mexico",
    "NY": "New York",
    "NC": "North Carolina",
    "ND": "North Dakota",
    "OH": "Ohio",
    "OK": "Oklahoma",
    "OR": "Oregon",
    "PA": "Pennsylvania",
    "RI": "Rhode Island",
    "SC": "South Carolina",
    "SD": "South Dakota",
    "TN": "Tennessee",
    "TX": "Texas",
    "UT": "Utah",
    "VT": "Vermont",
    "VA": "Virginia",
    "WA": "Washington",
    "WV": "West Virginia",
    "WI": "Wisconsin",
    "WY": "Wyoming",
    "DC": "District of Columbia",
    "AS": "American Samoa",
    "GU": "Guam",
    "MP": "Northern Mariana Islands",
    "PR": "Puerto Rico",
    "UM": "United States Minor Outlying Islands",
    "VI": "US Virgin Islands",
}

us_states |= {v: k for k, v in us_states.items()}

# https://en.wikipedia.org/wiki/List_of_largest_cities_of_U.S._states_and_territories_by_population
state_cities = {
    "Alabama": "Huntsville",
    "Alaska": "Anchorage",
    "Arizona": "Phoenix",
    "Arkansas": "Little Rock",
    "California": "Los Angeles",
    "Colorado": "Denver",
    "Connecticut": "Bridgeport",
    "Delaware": "Wilmington",
    "Florida": "Jacksonville",
    "Georgia": "Atlanta",
    "Hawaii": "Honolulu",
    "Idaho": "Boise",
    "Illinois": "Chicago",
    "Indiana": "Indianapolis",
    "Iowa": "Des Monies",
    "Kansas": "Wichita",
    "Kentucky": "Louisville",
    "Louisiana": "New Orleans",
    "Maine": "Portland",
    "Maryland": "Baltimore",
    "Massachusetts": "Boston",
    "Michigan": "Detroit",
    "Minnesota": "Minneapolis",
    "Mississippi": "Jackson",
    "Missouri": "Kansas City",
    "Montana": "Billings",
    "Nebraska": "Omaha",
    "Nevada": "Las Vegas",
    "New Hampshire": "Manchester",
    "New Jersey": "Newark",
    "New Mexico": "Albuquerque",
    "New York": "New York City",
    "North Carolina": "Charlotte",
    "North Dakota": "Fargo",
    "Ohio": "Columbus",
    "Oklahoma": "Oklahoma City",
    "Oregon": "Portland",
    "Pennsylvania": "Philadelphia",
    "Rhode Island": "Providence",
    "South Carolina": "Charleston",
    "South Dakota": "Sioux Falls",
    "Tennessee": "Nashville",
    "Texas": "Houston",
    "Utah": "Salt Lake City",
    "Vermont": "Burlington",
    "Virginia": "Virginia Beach",
    "Washington": "Seattle",
    "West Virginia": "Charleston",
    "Wisconsin": "Milwaukee",
    "Wyoming": "Cheyenne",
    "District of Columbia": "Washington DC",
}

This monster of a code block is responsible for taking the job_location and pulling out the city, state, and country. It's complicated because this dataset has it in like 20 different formats, and this tries to account for a lot of them.

It will also print out the invalid locations it encounters, or at least the most frequent ones.

In [13]:
valid_cities = set(
    zip(
        cities["name"],
        cities["state_name"],
        cities["country_code"],
    )
)
valid_countries = set(cities["country_code"])
us_cities = dict(
    zip(
        cities[cities["country_code"] == "US"]["name"],
        cities[cities["country_code"] == "US"]["state_name"],
    )
)
inv = Counter()


def to_columns(loc):
    city = state = country = None
    if len(loc) == 1:
        item = loc[0].replace(" Metropolitan Area", "")
        if item == "United States" or "New York" in item:
            city = "New York City"
            state = "New York"
            country = "US"
        if item not in us_cities:
            item = item.replace("St ", "Saint ")
        for us_city, us_state in us_cities.items():
            if us_city in item:
                city = us_city
                state = us_state
                country = "US"

    if len(loc) == 2:
        """
        Formats:
        - city, state (U.S.)
        - United Kingdom, Great Britain (imputate city with London)
        - state, United States
        """

        temp_state = loc[1].replace(" Metropolitan Area", "")
        if temp_state in us_states:
            state = us_states[temp_state]
            city = loc[0].replace(" Metropolitan Area", "")
            country = "US"
            if city == state == "New York":
                city = "New York City"
            if city not in us_cities:
                city = city.replace("St ", "Saint ")
            if city not in us_cities:
                for us_city, us_state in us_cities.items():
                    if us_city in loc[0] and state == us_state:
                        city = us_city
        elif loc[1] == "United States" or loc[1] == "US":
            state = loc[0].replace(" Metropolitan Area", "")
            # Imputates with largest city in state
            city = state_cities[state]
            country = "US"
        elif loc[1] == "United Kingdom":
            state = loc[0].replace(" Metropolitan Area", "")
            # Imputate with London
            city = "London"
            country = "GB"

        if city == "Washington" and state == "District of Columbia":
            city = "Washington DC"

    elif len(loc) == 3:
        """
        Formats:
        city, state, country
        """

        # Full country name to alpha 2 code. Ex: United States -> US
        country_obj = pycountry.countries.get(name=loc[2])
        if country_obj is not None:
            country = country_obj.alpha_2
        else:
            if loc[2] == "Taiwan":
                country = "TW"
            elif loc[2] == "Turkey":
                country = "TR"
            elif loc[2] == "Tanzania":
                country = "TZ"

        city = loc[0].replace(" Metropolitan Area", "")

        # Code to full name. Ex: AL -> Alabama
        if country == "US" and loc[1] in us_states.keys() and len(loc[1]) == 2:
            state = us_states[loc[1]]
        else:
            state = loc[1].replace(" Metropolitan Area", "")

    if (city, state, country) not in valid_cities:
        inv[", ".join(loc) + f", Split: {city}, {state}, {country}"] += 1
    return [city, state, country]


df[["city", "state", "country"]] = pd.DataFrame(
    list(df["job_loc_list"].map(to_columns))
)

# Prints the occurences of the invalid cities (ones that aren't in the cities dataset)
# This is either because of formatting errors, or it's a small city that's not in the dataset
len(inv), sorted(inv.items(), key=lambda x: -x[1])[:10]
Out[13]:
(11173,
 [('West Midlands, England, United Kingdom, Split: West Midlands, England, GB',
   806),
  ('West Yorkshire, England, United Kingdom, Split: West Yorkshire, England, GB',
   411),
  ('Austin, Texas Metropolitan Area, Split: Austin, TX, US', 356),
  ('Charleston, South Carolina Metropolitan Area, Split: Charleston, SC, US',
   354),
  ('Brick, NJ, Split: Brick, New Jersey, US', 351),
  ('Little London, England, United Kingdom, Split: Little London, England, GB',
   346),
  ('University Park, PA, Split: University Park, Pennsylvania, US', 329),
  ('Hull, England, United Kingdom, Split: Hull, England, GB', 327),
  ('West Chester, OH, Split: West Chester, Ohio, US', 320),
  ('Scotland, United Kingdom, Split: London, Scotland, GB', 318)])
In [14]:
del valid_cities  # Clear memory

Now we finally merge the postings table with the cities dataset. We do a left join so that we can see which rows failed (by checking null values)

In [15]:
df = df.merge(
    cities,
    how="left",
    left_on=["city", "state", "country"],
    right_on=["name", "state_name", "country_code"],
)
df
Out[15]:
job_link last_processed_time got_summary got_ner is_being_worked job_title company job_location first_seen search_city ... job_loc_list city state country name state_name country_code country_name latitude longitude
0 https://www.linkedin.com/jobs/view/account-exe... 2024-01-21 07:12:29.002560+00:00 True True False Account Executive - Dispensing (NorCal/Norther... BD San Diego, CA 2024-01-15 Coronado ... [San Diego, CA] San Diego California US San Diego California US United States 32.71571 -117.16472
1 https://www.linkedin.com/jobs/view/registered-... 2024-01-21 07:39:58.881370+00:00 True True False Registered Nurse - RN Care Manager Trinity Health MI Norton Shores, MI 2024-01-14 Grand Haven ... [Norton Shores, MI] Norton Shores Michigan US Norton Shores Michigan US United States 43.16890 -86.26395
2 https://www.linkedin.com/jobs/view/restaurant-... 2024-01-21 07:40:00.251126+00:00 True True False RESTAURANT SUPERVISOR - THE FORKLIFT Wasatch Adaptive Sports Sandy, UT 2024-01-14 Tooele ... [Sandy, UT] Sandy Utah US Sandy Utah US United States 40.59161 -111.88410
3 https://www.linkedin.com/jobs/view/independent... 2024-01-21 07:40:00.308133+00:00 True True False Independent Real Estate Agent Howard Hanna | Rand Realty Englewood Cliffs, NJ 2024-01-16 Pinehurst ... [Englewood Cliffs, NJ] Englewood Cliffs New Jersey US Englewood Cliffs New Jersey US United States 40.88538 -73.95236
4 https://www.linkedin.com/jobs/view/registered-... 2024-01-21 08:08:19.663033+00:00 True True False Registered Nurse (RN) Trinity Health MI Muskegon, MI 2024-01-14 Muskegon ... [Muskegon, MI] Muskegon Michigan US Muskegon Michigan US United States 43.23418 -86.24839
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1294263 https://www.linkedin.com/jobs/view/registered-... 2024-01-20 15:21:07.786118+00:00 True True False Registered Nurse (RN) #CP-RN-7998660 - 2411627... TravelNurseSource Providence, RI 2024-01-14 Fall River ... [Providence, RI] Providence Rhode Island US Providence Rhode Island US United States 41.82399 -71.41283
1294264 https://www.linkedin.com/jobs/view/constructio... 2024-01-20 15:21:10.885264+00:00 True True False Construction Superintendent Jobot New Iberia, LA 2024-01-15 Lafayette ... [New Iberia, LA] New Iberia Louisiana US New Iberia Louisiana US United States 30.00354 -91.81873
1294265 https://www.linkedin.com/jobs/view/executive-c... 2024-01-21 07:40:00.304641+00:00 True True False Executive Chef, Operations Support NEXDINE Hospitality Riverhead, NY 2024-01-14 Eastport ... [Riverhead, NY] Riverhead New York US Riverhead New York US United States 40.91704 -72.66204
1294266 https://www.linkedin.com/jobs/view/rn-register... 2024-01-21 00:38:39.816821+00:00 True True False RN- Registered Nurse, Analyst - - 23934913EXPP... TravelNurseSource Aurora, CO 2024-01-16 Colorado ... [Aurora, CO] Aurora Colorado US Aurora Colorado US United States 39.72943 -104.83192
1294267 https://www.linkedin.com/jobs/view/on-demand-g... 2024-01-21 00:38:44.231492+00:00 True True False On-Demand: Guest Advocate (Cashier), General M... Target Culver City, CA 2024-01-12 Malibu ... [Culver City, CA] Culver City California US Culver City California US United States 34.02112 -118.39647

1294268 rows × 26 columns

These are the examples of failed rows in the merge, and these are the job_location formats that we seek to include in the to_columns() function.

In [16]:
# Invalid merge rows
m = df[df["latitude"].isnull()]
print(m["country"].value_counts())
m
country
US    41903
GB    15871
CA     5395
AU     2335
MX      644
FR       29
MY       25
AE       10
IN       10
SA        9
PH        6
AR        5
ES        5
IT        5
BR        4
GR        4
IE        4
BE        3
CN        3
CH        3
HU        3
TR        3
SE        2
DK        2
CM        2
EG        1
NG        1
DE        1
ID        1
TZ        1
JP        1
CZ        1
PL        1
TW        1
CL        1
Name: count, dtype: int64
Out[16]:
job_link last_processed_time got_summary got_ner is_being_worked job_title company job_location first_seen search_city ... job_loc_list city state country name state_name country_code country_name latitude longitude
8 https://www.linkedin.com/jobs/view/special-age... 2024-01-21 08:08:21.308995+00:00 True True False Special Agent: Law/Legal Background Federal Bureau of Investigation (FBI) Austin, Texas Metropolitan Area 2024-01-16 Austin ... [Austin, Texas Metropolitan Area] Austin TX US NaN NaN NaN NaN NaN NaN
17 https://www.linkedin.com/jobs/view/analyst-ass... 2024-01-21 02:01:15.820041+00:00 True True False Analyst, Associate Plan Configuration & Qualit... RITE AID Yocumtown, PA 2024-01-13 Lebanon ... [Yocumtown, PA] Yocumtown Pennsylvania US NaN NaN NaN NaN NaN NaN
25 https://uk.linkedin.com/jobs/view/property-man... 2024-01-21 06:01:13.568960+00:00 True True False Property Manager Knight Dragon Ltd Royal Borough of Greenwich, England, United Ki... 2024-01-14 Basildon ... [Royal Borough of Greenwich, England, United K... Royal Borough of Greenwich England GB NaN NaN NaN NaN NaN NaN
59 https://uk.linkedin.com/jobs/view/ct-technolog... 2024-01-21 06:01:14.166308+00:00 True True False CT Technologist Host Healthcare, Inc. Elsworth, England, United Kingdom 2024-01-16 Peterborough ... [Elsworth, England, United Kingdom] Elsworth England GB NaN NaN NaN NaN NaN NaN
104 https://uk.linkedin.com/jobs/view/corporate-re... 2024-01-21 07:43:54.617770+00:00 True True False Corporate Real Estate Manager (f/m/div) Siemens Gamesa Hull, England, United Kingdom 2024-01-17 Kingston upon Hull ... [Hull, England, United Kingdom] Hull England GB NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1294185 https://www.linkedin.com/jobs/view/mascot-at-w... 2024-01-20 05:13:01.735342+00:00 True True False Mascot Westgate Resorts River Ranch, FL 2024-01-14 Florida ... [River Ranch, FL] River Ranch Florida US NaN NaN NaN NaN NaN NaN
1294200 https://www.linkedin.com/jobs/view/fleet-maint... 2024-01-20 05:13:08.025382+00:00 True True False Fleet Maintenance Manager - Year Round in Echo... Energy Jobline Echo, UT 2024-01-17 Bountiful ... [Echo, UT] Echo Utah US NaN NaN NaN NaN NaN NaN
1294221 https://www.linkedin.com/jobs/view/scientist-i... 2024-01-21 08:28:03.629178+00:00 True True False Scientist IV - Clearance Preferred LMI Fort Detrick, MD 2024-01-14 Fairfield ... [Fort Detrick, MD] Fort Detrick Maryland US NaN NaN NaN NaN NaN NaN
1294231 https://ca.linkedin.com/jobs/view/academic-tut... 2024-01-20 13:46:05.388491+00:00 True True False Academic tutoring Private Tutoring Jobs Catalone ClickJobs.io Catalone, Nova Scotia, Canada 2024-01-16 Nova Scotia ... [Catalone, Nova Scotia, Canada] Catalone Nova Scotia CA NaN NaN NaN NaN NaN NaN
1294245 https://www.linkedin.com/jobs/view/ft-center-s... 2024-01-20 13:46:10.289403+00:00 True True False FT Center Store Manager (H) Food Lion Powhatan, VA 2024-01-14 Abingdon ... [Powhatan, VA] Powhatan Virginia US NaN NaN NaN NaN NaN NaN

69709 rows × 26 columns

In [17]:
df = df.dropna()
df.reset_index(inplace=True, drop=True)

We only lost ~5% of the original data which isn't terrible, but definitely could use improvement.

In [18]:
print(
    "Percent of original dataset retained:", round(100 * df.shape[0] / 1300000, 3)
)
Percent of original dataset retained: 94.197

3b. Basic Data Exploration and Summary Statistics¶

Now thanks to the cities dataset, we can get a cool choropleth plot which you can interact with via plotly. It shows the number of jobs per country.

In [19]:
map_df = df.groupby(["country", "country_name"]).size()
map_df = pd.DataFrame(map_df).reset_index()
map_df.rename(
    {"country": "alpha_2", "country_name": "country", 0: "job count"},
    inplace=True,
    axis=1,
)
map_df["Log10(job count)"] = np.log10(map_df["job count"])
map_df["alpha_3"] = map_df["alpha_2"].map(
    lambda x: pycountry.countries.get(alpha_2=x).alpha_3
)
map_df

fig = px.choropleth(
    map_df,
    locations="alpha_3",
    hover_name="country",
    color="Log10(job count)",
    hover_data=["job count"],
    projection="orthographic",
    color_continuous_scale="viridis",
)
fig.update_layout(
    margin=dict(l=0, r=0, t=0, b=0, pad=4, autoexpand=True),
    width=600,
)
fig.update_coloraxes(colorbar_len=0.6)
fig.show()

Here, we can do the same thing for cities except with a geospatial scatter plot. Since there's an insanely large amount of cities, we limit it only to cities with > 100 jobs so the map isn't extremely laggy.

In [20]:
latlon = df.groupby(["latitude", "longitude", "city", "state", "country"]).size()
latlon = pd.DataFrame(latlon).reset_index().rename({0: "count"}, axis=1)
fig = px.scatter_geo(
    latlon.query("count > 100"),
    lat="latitude",
    lon="longitude",
    size="count",
    color="count",
    color_continuous_scale="viridis",
    hover_data=["city", "state", "country"],
    opacity=1,
    projection="orthographic",
)
fig.update_layout(
    margin=dict(l=0, r=0, t=0, b=0, pad=4, autoexpand=True),
    width=600,
)
fig.update_coloraxes(colorbar_len=0.6)
fig.show()
del latlon  # Clear memory
In [21]:
top3 = map_df.sort_values("job count", ascending=False).iloc[:3][
    ["country", "job count"]
]
display(top3)
top3 = set(top3["country"].values)
country job count
24 United States 1061554
8 United Kingdom 91708
3 Canada 45952

Recall that the job_level column consists of 2 values: associate and mid-senior. We want to see which countries have a higher proportion of high-level. Keep in mind that Mid-Senior == 1 and Associate == 0, so the mean of the column is equivalent to the proportion of mid-senior positions. To do this, we did a 2 sample z test for proportion. However, since we're comparing 3 countries, we will need to do 3 of these hypothesis tests. ANOVA should likely not be used, since we are dealing with a binomial variable, which cannot be assumed to be random (proportion is technically the mean of a binomial variable).

We will use $\alpha$=0.05

In [22]:
num_mid_high = df.query("country_name in @top3").copy()
num_mid_high["job_level"] = num_mid_high["job_level"].apply(
    lambda x: job_level[x]
)  # Mid-Senior == 1, Associate == 0
num_mid_high = num_mid_high.groupby("country_name")["job_level"].sum().sort_values()
num_jobs = (
    df.query("country_name in @top3")
    .groupby("country_name")["job_level"]
    .count()
    .sort_values()
)
props = num_mid_high / num_jobs
fig = props.plot(kind="bar", backend="plotly")
fig.update_layout(
    yaxis_range=[0, 1],
    title="Sample Proportion of Mid-Senior Jobs by (Top 3) Countries",
)

This is supported by the $\chi^2$ test for the same data ($p<<\alpha$). We create the contingency table (and a proportion table just for visualization), and carry out the $\chi^2$ test.

In [23]:
contingency_table = pd.crosstab(
    df.query("country_name in @top3")["country_name"],
    df.query("country_name in @top3")["job_level"],
)
display(contingency_table)
display((contingency_table.T / contingency_table.sum(axis=1)).T)
display(
    contingency_table.plot(
        kind="bar", title="Frequency", backend="plotly", barmode="group"
    )
)
display(
    ((contingency_table.T / contingency_table.sum(axis=1)).T).plot(
        kind="bar", title="Probability distributions", backend="plotly", barmode="group"
    )
)
stats.chi2_contingency(contingency_table)
job_level Associate Mid senior
country_name
Canada 5134 40818
United Kingdom 8166 83542
United States 117017 944537
job_level Associate Mid senior
country_name
Canada 0.111725 0.888275
United Kingdom 0.089043 0.910957
United States 0.110232 0.889768
Out[23]:
Chi2ContingencyResult(statistic=395.8733435193279, pvalue=1.0894208642323106e-86, dof=2, expected_freq=array([[  4993.54309072,  40958.45690928],
       [  9965.78712056,  81742.21287944],
       [115357.66978871, 946196.33021129]]))

Since we rejected the null hypothesis, we do a post hoc test, by doing a post hoc test showing for the difference in proportion between each country. These 2 sample Z tests for the proportion of jobs which are mid-senior confirm this. They state with high confidence that the UK has more than USA and Canada. However, the proportion of mid-senior jobs seem to be about the same for the U.S. and Canada ($\alpha > 0.05$ so we can't reject the null hypothesis that they are equal)

In [24]:
print("USA > Canada")
print(
    proportions_ztest(
        num_mid_high[["United States", "Canada"]],
        num_jobs[["United States", "Canada"]],
        alternative="larger",
    )
)

print("UK > USA")
print(
    proportions_ztest(
        num_mid_high[["United Kingdom", "United States"]],
        num_jobs[["United Kingdom", "United States"]],
        alternative="larger",
    )
)

print("UK > Canada")
print(
    proportions_ztest(
        num_mid_high[["United Kingdom", "Canada"]],
        num_jobs[["United Kingdom", "Canada"]],
        alternative="larger",
    )
)
USA > Canada
(1.0005819489094683, 0.1585144803056488)
UK > USA
(19.790144761015508, 1.8100028997368962e-87)
UK > Canada
(13.432934673059961, 1.9386787397135185e-41)

Cool. Now we look at the job_type, which is Remote, Onsite, or Hybrid. We also compare these between our top 3 countries. We will do this using another $\chi^2$ test.

In [25]:
contingency_table = contingency_table = pd.crosstab(
    df.query("country_name in @top3")["country_name"],
    df.query("country_name in @top3")["job_type"],
)
display(contingency_table)
display((contingency_table.T / contingency_table.sum(axis=1)).T)
display(
    contingency_table.plot(
        kind="bar", title="Frequency", backend="plotly", barmode="group"
    )
)
display(
    ((contingency_table.T / contingency_table.sum(axis=1)).T).plot(
        kind="bar", title="Probability distributions", backend="plotly", barmode="group"
    )
)
stats.chi2_contingency(contingency_table)
job_type Hybrid Onsite Remote
country_name
Canada 326 45515 111
United Kingdom 774 90723 211
United States 4909 1053014 3631
job_type Hybrid Onsite Remote
country_name
Canada 0.007094 0.990490 0.002416
United Kingdom 0.008440 0.989259 0.002301
United States 0.004624 0.991955 0.003420
Out[25]:
Chi2ContingencyResult(statistic=330.6899939082778, pvalue=2.585770377761359e-70, dof=4, expected_freq=array([[2.30255457e+02, 4.55702718e+04, 1.51472761e+02],
       [4.59528801e+02, 9.09461718e+04, 3.02299443e+02],
       [5.31921574e+03, 1.05273556e+06, 3.49922780e+03]]))

$p<<\alpha$, so we reject the null hypothesis. The frequency of remote, hybrid, and onsite jobs is clearly different for each of the countries.

Now for the next tests, we will look at the companies. In particular, we want to see how large the companies are in each country, as large companies may indicate a strong and stable economy. To do this, we can take a look at the distribution of job counts among the top companies in each country. For example, we look at the distribution of the top 50 companies from the U.S., Canada, and U.K. The company names are not a numerical variable, but if we replace them with their rank (1st largest company, 2nd largest company, ...), then we can get ourselves a probability mass function.

We will then want to compare the 3 probability distributions. To do this, we use the 2 Sample Kolmogorov-Smirnov test for goodness of fit between each of the 3 countries.

In [26]:
company_counts = (
    df.query("country_name in @top3")
    .groupby("country_name")["company"]
    .value_counts()
    .groupby("country_name")
    .head(50)
    .reset_index()
)
company_counts
Out[26]:
country_name company count
0 Canada CareerBeacon 1355
1 Canada ClickJobs.io 1197
2 Canada Canadian Armed Forces | Forces armées canadiennes 812
3 Canada Stantec 719
4 Canada Vancouver Coastal Health 663
... ... ... ...
145 United States WIS International 1839
146 United States The Home Depot 1789
147 United States Northrop Grumman 1746
148 United States Trinity Health 1744
149 United States Ross Stores, Inc. 1734

150 rows × 3 columns

In [27]:
fig = make_subplots(3, 1)

for i, country in enumerate(["United States", "Canada", "United Kingdom"]):
    fig.add_trace(
        go.Bar(
            text=company_counts.query("country_name == @country")["company"],
            y=company_counts.query("country_name == @country")["count"],
            name=country,
        ),
        row=i + 1,
        col=1,
    )
fig.update_layout(
    height=2000, title="Top 50 Company Job Distribution for each of the Top 3 Countries"
)
fig.update_xaxes(title="Nth Largest Company")
fig.update_yaxes(title="Number of Jobs")
fig

Scipy's 2 sample KS test needs a sample, not a PMF. So, we just create that for each country and print out the statistics for each sample.

In [28]:
# Gets samples from each of the countries
pmfs = {}
for country in top3:
    data = []
    counts = company_counts.query("country_name == @country")["count"].reset_index()[
        "count"
    ]
    for i, x in enumerate(counts):
        data += [i] * x
    pmfs[country] = data
    sum_stats = pd.DataFrame({country: data}).describe()
    sum_stats.loc["skew"] = stats.skew(data)
    display(sum_stats)
United States
count 259497.000000
mean 13.188993
std 13.685892
min 0.000000
25% 2.000000
50% 8.000000
75% 22.000000
max 49.000000
skew 0.987267
Canada
count 13950.000000
mean 14.430323
std 13.962532
min 0.000000
25% 3.000000
50% 10.000000
75% 24.000000
max 49.000000
skew 0.891284
United Kingdom
count 23149.000000
mean 14.552162
std 14.149398
min 0.000000
25% 2.000000
50% 10.000000
75% 24.000000
max 49.000000
skew 0.848315
In [29]:
for c1, c2 in itertools.combinations(top3, 2):
    print("Comparing distributions of:", c1, ",", c2)
    print(stats.ks_2samp(pmfs[c1], pmfs[c2]))
Comparing distributions of: United States , Canada
KstestResult(statistic=0.061611845347954136, pvalue=3.9712986155353705e-44, statistic_location=2, statistic_sign=1)
Comparing distributions of: United States , United Kingdom
KstestResult(statistic=0.054722840970027664, pvalue=9.298692564618634e-56, statistic_location=3, statistic_sign=1)
Comparing distributions of: Canada , United Kingdom
KstestResult(statistic=0.05345704181311936, pvalue=4.628346419662773e-22, statistic_location=0, statistic_sign=-1)

All null hypotheses have been rejected ($p<<\alpha$), meaning that all of the pmf's are statistically different for each country. According to the statistics and graph above, it's shown that Canada has the more "flat" distribution while U.S. and Australia have a stronger peak at 0, meaning larger companies.

Conclusions of Exploratory Data Analysis¶
  • The VAST amount of job postings on LinkedIn are located in the U.S., Canada, U.K., and Australia. We choose the top 3 to do hypothesis testing on.
  • Jobs are heavily concentrated in major cities, as seen by the geoscatter plot.
  • We have shown with statistical significance that the proportion of mid-senior jobs between the three countries (U.S., U.K., Canada) is as follows in ascending order: Canada, United States, United Kingdom. We also followed up with Post-Hoc testing to confirm this.
  • We have shown with statistical significance that the frequencies distributions for the job type (On-site, remote, hybrid) differ from country to country.
  • We have shown with statistical significance that the distribution for job concentration among the top 50 companies in each country differ, and from the summary statistics and plots, we can determine that Canada has more equally large companies in the top 50, whereas the U.S. and Australia have a more concentrated company size towards the top.

4. Primary Analysis¶

Now it's finally time to look at the text fields (the job skills, and we ignore the summaries because doing this analysis on them is just too computationally heavy). Since we have a list of job skills for each posting, we want to somehow group them together into, perhaps, different industries, fields, etc. While we do have some sort of indication for this via the job_title column, this is VERY unformatted, so supervised learning is out the window. We instead want to do unsupervised learning using K Means Clustering. We want to cluster the different sentences, but in order to do so, we need to vectorize each sentence. To do this, we use SentenceTransformers using the all-MiniLM-L6-v2 transformer to turn each sentence into a 384 element vector.

In [30]:
transformer = SentenceTransformer("all-MiniLM-L6-v2", cache_folder=".")

This code actually took like 20 minutes on a GTX 1080 so we save it the first time it's generated so we only have to generate it once. I would not recommend running this on a CPU... As you can see, for every one of the ~1.2 million postings, we have a 384 length vector associated with the skills for that posting.

In [31]:
if os.path.isfile("skills_encodings.npz"):
    encodings = np.load("skills_encodings.npz")["skills"]
else:
    encodings = transformer.encode(df.job_skills)
    np.savez_compressed("skills_encodings.npz", skills=encodings)

print(encodings.shape)
(1224559, 384)

To figure out the number of clusters for the K Means Clustering, we use both the elbow method and we look at the silhouette scores.

In [32]:
del cities, length_of_cities, m
In [33]:
silhouette = []
wcss = []
clusters = [2, 3, 4, 5, 7, 9, 11, 13, 15]
for i in clusters:
    model = cluster.KMeans(
        n_clusters=i, max_iter=100, init="k-means++", random_state=123
    ).fit(encodings)
    silhouette.append(
        metrics.silhouette_score(encodings, model.labels_, sample_size=1000)
    )
    wcss.append(model.inertia_)
    print(f"{i} clusters done")

silhouette_data = pd.DataFrame(
    {"Number of Clusters": clusters, "Silhouette Coefficient": silhouette}
)
wcss_data = pd.DataFrame({"Number of Clusters": clusters, "WCSS": wcss})
px.line(silhouette_data, x="Number of Clusters", y="Silhouette Coefficient").show()
px.line(wcss_data, x="Number of Clusters", y="WCSS").show()
2 clusters done
3 clusters done
4 clusters done
5 clusters done
7 clusters done
9 clusters done
11 clusters done
13 clusters done
15 clusters done

With just the elbow method, it would have been unclear. However, with the silhouette score, we can determine that 7 is an optimal number of clusters, so that's what we'll use. A higher silhouette score is desirable, but 2-4 clusters is insufficient for this many groups of jobs. Our goal is to get a cluster for tech jobs.

In [34]:
num_clusters = 7
labels = cluster.KMeans(
    n_clusters=num_clusters,
    max_iter=100,
    init="k-means++",
    random_state=123,
).fit_predict(encodings)

5. Visualization¶

Now we want to see the sentences in the different groups to view the clusters that the model formed, and what they could mean.

In [35]:
clustered_sentences = [[] for _ in range(num_clusters)]
for i in range(len(labels)):
    clustered_sentences[labels[i]].append(df.job_skills[i])

We print a Word Cloud, which shows the most frequent phrases or words as large text. We create one for every cluster so that we can see what each cluster indicates.

In [36]:
from collections import Counter

stop_words = [
    "to",
    "in",
    "and",
    "of",
    "a",
    "s",
    "skills",
    "management",
    "with",
]


for i in range(num_clusters):
    
    word_counter = Counter(" ".join(clustered_sentences[i]).split(" "))
    for stop_word in stop_words:
        if stop_word in word_counter:
            del word_counter[stop_word]
    fig = px.imshow(WordCloud(width=1920, height=1080, random_state=123).generate_from_frequencies(word_counter))
    fig.update_xaxes(showticklabels=False).update_yaxes(showticklabels=False)
    fig.update_layout(margin=dict(l=0, r=0, t=0, b=0)).show()

It seemed to work pretty well. We can describe each cluster relatively well.

  • Cluster 0: Therapist / Counselor
  • Cluster 1: Business Analytics
  • Cluster 2: General Inventory / Restaurant Jobs
  • Cluster 3: Computer Science Jobs
  • Cluster 4: Medical Jobs
  • Cluster 5: Business Management
  • Cluster 6: Retail jobs

We can now assign every job a predicted label, and store this relationship.

In [37]:
clusters = {
    0: "Therapist / Counselor",
    1: "Business Analytics",
    2: "General Inventory / Restaurant Jobs",
    3: "Computer Science Jobs",
    4: "Medical Jobs",
    5: "Business Management",
    6: "Retail jobs",
}
In [38]:
df["labels"] = labels.copy()
C:\Users\Vikram\AppData\Local\Temp\ipykernel_4580\3709420813.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A quick preview to check if it worked by taking a look at the job_title of some of the tech jobs. As we can see, it appears to have worked, since the job titles look like they are related to computing or mostly STEM related positions!

In [39]:
df.query("labels == 3")["job_title"][:20]
Out[39]:
19                        Test Conductor - Space Systems
34          Sr Experience Design Manager, Learn and Help
47                                            Designer I
51     CIBC | Analyste Principal, Intégration des cli...
84              Senior Network Engineering Lead - Remote
98                      Lead Software Engineer, Back End
99               Founding Fullstack Engineer, AI Startup
101                              Design Strategy Manager
131                                  RESEARCH ANALYST II
141                                     Android Engineer
170                         Ad Operations Manager (Hulu)
171                     Senior Software Engineer, DevOps
173                     Senior Machine Learning Engineer
198    Contracts Lifecycle Management (CLM) Solutions...
211                         Senior Specialist Programmer
225                   IT / Business Framework Specialist
242                                        R&D Scientist
243                     Distinguished Applied Researcher
270                             Senior Flutter Developer
278         Principal Software Engineer, ML Accelerators
Name: job_title, dtype: object

Now we can try it using topic modeling as well, since it may be a better method here. Also we free up some memory. Note that this is sort of a combination of steps 4 and 5 (primary analysis and visualization), but it's fine

In [40]:
del encodings, labels, word_counter
In [41]:
cv = feature_extraction.text.CountVectorizer(analyzer="word", ngram_range=(2, 2), min_df=0.01, max_df=0.5)

encodings = cv.fit_transform(df["job_skills"])
encodings.shape
Out[41]:
(1224559, 269)

This also takes a long time. Set the n_jobs parameter to just under how many CPU cores you have to speed it up a LOT. It took almost 8 minutes using 14 cores on a Ryzen 7 1700 (a pretty strong CPU even by today's standards), as a reference.

In [42]:
labels = decomposition.LatentDirichletAllocation(n_components=7, random_state=123, n_jobs=14).fit_transform(encodings)
labels.shape
Out[42]:
(1224559, 7)

Now we do the same grouping and word cloud as before

In [43]:
clustered_sentences = [[] for _ in range(num_clusters)]
for i in range(len(labels)):
    clustered_sentences[labels[i].argmax()].append(df.job_skills[i])
In [44]:
from collections import Counter

# Exclude these words
stop_words = [
    "to",
    "in",
    "and",
    "of",
    "a",
    "s",
    "skills",
    "management",
    "with",
]


for i in range(num_clusters):
    
    word_counter = Counter(" ".join(clustered_sentences[i]).split(" "))
    for stop_word in stop_words:
        if stop_word in word_counter:
            del word_counter[stop_word]
    fig = px.imshow(WordCloud(width=1920, height=1080, random_state=123).generate_from_frequencies(word_counter))
    fig.update_xaxes(showticklabels=False).update_yaxes(showticklabels=False)
    fig.update_layout(margin=dict(l=0, r=0, t=0, b=0)).show()
In [45]:
del encodings, labels, word_counter  # Clear memory again

As you can see, there is somewhat of a cluster for computing jobs (cluster 2), but it is not as pronounced as the KMeans method. Therefore, we'll stick with the kmeans values, which we've already stored in the dataframe.

6. Insights and Conclusions¶

We can now use these labels to analyze the software jobs. But before that, let's take a look at the distribution of jobs between the clusters.

In [46]:
fig = df["labels"].value_counts().rename(index=clusters).plot.bar()
fig.update_layout(
    title="Count of Jobs per Cluster", xaxis_title="Clusters", yaxis_title="Count"
)

So we have a solid 92k CS jobs to look at. Now we can do hypothesis tests on these between countries, which is similar in what we did earlier.

In [47]:
fig = (
    df.query("labels == 3 and country_name in @top3")["country_name"]
    .value_counts()
    .plot.bar()
)
fig.update_layout(
    title="CS Jobs by Country", xaxis_title="Top 3 Countries", yaxis_title="Count"
)

To do the hypothesis test, we want to look at the proportion of CS jobs in each of the top 3 countries and compare those using a $\chi^2$ test.

Here, the "True" label is CS jobs, while "False" are the non-CS jobs

In [49]:
contingency_table = pd.crosstab(
    df.query("country_name in @top3")["country_name"],
    df.query("country_name in @top3")["labels"] == 3,
)
display(contingency_table)
display((contingency_table.T / contingency_table.sum(axis=1)).T)
display(
    contingency_table.plot(
        kind="bar", title="Frequency", backend="plotly", barmode="group"
    )
)
display(
    ((contingency_table.T / contingency_table.sum(axis=1)).T).plot(
        kind="bar", title="Probability distributions", backend="plotly", barmode="group"
    )
)
stats.chi2_contingency(contingency_table)
labels False True
country_name
Canada 41674 4278
United Kingdom 85089 6619
United States 982405 79149
labels False True
country_name
Canada 0.906903 0.093097
United Kingdom 0.927825 0.072175
United States 0.925440 0.074560
Out[49]:
Chi2ContingencyResult(statistic=230.07197381576603, pvalue=1.0977556110093208e-50, dof=2, expected_freq=array([[ 42501.57848057,   3450.42151943],
       [ 84821.87411421,   6886.12588579],
       [981844.54740522,  79709.45259478]]))

At this sample size, it's not surprising that the p value is almost 0. Also, it shows that Canada ends up having the highest proportion of CS jobs on LinkedIn, with the U.S. in 2nd and the U.K. in third.

Final Conclusions¶

  • On LinkedIn, there are many more jobs in the U.S. than other countries.
  • There are a higher proportion of senior jobs in the U.K. compare to the U.S. and Canada
  • There really aren't that many remote or hybrid jobs on LinkedIn
  • For Computer Science jobs, they are most common in Canada, with the U.S. next and the U.K. last
  • There are MANY medical job postings on LinkedIn, if that interests you

I hope that these insights will help you on your job search. If the above restrictions affect you, then it is recommended to other websites (Indeed, etc.)